fix: Colab compatibility, NYC TLC dataset, new README and AGENTS.md#1
Open
CLAV88 wants to merge 1 commit intocoder2j:mainfrom
Open
fix: Colab compatibility, NYC TLC dataset, new README and AGENTS.md#1CLAV88 wants to merge 1 commit intocoder2j:mainfrom
CLAV88 wants to merge 1 commit intocoder2j:mainfrom
Conversation
- Remove os.environ SPARK_HOME cell from all notebooks
(hardcoded local Mac path breaks Colab — pip install pyspark
bundles its own binaries, no SPARK_HOME needed)
- Replace all ./data/ relative paths with /content/data/
(Colab resolves relative paths against a temp kernel directory
that does not exist — absolute paths required)
- Add mode('overwrite') to all df.write cells
(default write mode is error — fails on every re-run after first,
which is the normal learner workflow)
- Add idempotent git clone guard in data bootstrap cells
(exit code 128 on re-run because destination directory already
exists — shutil.rmtree guard makes bootstrap safe to re-run)
- Replace synthetic sample data with NYC TLC Yellow Taxi dataset
(Jan 2023, ~3M rows, 19 columns including fare_amount, tip_amount,
payment_type, datetime fields — more representative of production
data engineering than 5-row synthetic samples)
- Rewrite README.md with stage structure, Colab setup instructions,
per-stage test questions, and embedded SVG diagrams
- Add AGENTS.md for AI-assisted learning discovery and teaching
walkthrough guidance
- Add assets/ folder with 4 SVG diagrams:
partition-vs-table, rdd-vs-dataframe, csv-vs-parquet,
groupby-vs-window
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR fixes
This PR makes the tutorial work correctly on Google Colab,
which is how most beginners will run it.
Problems fixed
Environment cell breaks Colab — The
os.environ SPARK_HOMEcell in every notebook hardcodes a local Mac path that does not
exist on Colab's VM. Replaced with a
pip install pysparksetupcell that works on any machine.
Relative data paths fail — All
./data/paths replaced with/content/data/absolute paths that Colab can resolve.Write cells fail on re-run — Added
.mode("overwrite")to alldf.writecells. Default mode iserror— fails every time afterthe first run, which is the normal learner workflow.
Clone fails on re-run — Added existence check before
git clonecalls. Without this, re-running any notebook throws exit code 128.
Dataset replacement
Replaced all synthetic sample data (5-20 rows) with the NYC TLC
Yellow Taxi dataset (Jan 2023, ~3M rows, 19 columns). Real financial
columns —
fare_amount,tip_amount,payment_type— make everyexercise more meaningful and representative of production data work.
New files
README.md— rewritten with stage structure, Colab setupinstructions, per-stage test questions, and embedded diagrams
AGENTS.md— AI-assisted teaching guidance and common errorreference for learners using AI tools to work through the tutorial
assets/— 4 SVG diagrams illustrating key conceptsTested on
Google Colab, Spark 4.0.2, pip-installed pyspark